ENV221 NOTE

R in statistics (Fundamental)

Author

TC-tea

Published

2023.05.08

🚀

Hyperlink of XJTLU ENV221 courseware

👉🏻Click to enter the ENV222 note section

1 graph part & R basic code

1.1 Scatterplot

library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Sepal.Width)) + 
  geom_point(aes(x = Sepal.Length, y = Sepal.Width, color = Species)) + 
  geom_smooth(method = "lm") + facet_wrap(~ Species)

1.2 Barplot

library(ggplot2)
ggplot(iris) + geom_bar(aes(Species), width = 0.5) + coord_flip()

1.3 Histogram

library(ggplot2)
ggplot(iris) + geom_histogram(aes(Sepal.Length))

1.4 Boxplot

# R built-in way
River_A <- c(2, 1.95, 1.99, 2.08, 1.99, 1.95, 2.03, 2.09, 2.07, 2.01)
River_B <- c(1.85, 1.94, 1.87, 1.89, 1.91, 1.93, 2.01, 2, 1.96, 1.98)
boxplot(River_A, River_B, names = c('River A','River B'))

# ggplot2 way
library(ggplot2)
ggplot(iris) + geom_boxplot(aes(Sepal.Length))

1.5 Pairplot

# install.packages("GGally")
library(GGally)
ggpairs(iris, aes(colour=Species, alpha=0.5))

ggsave("df_plot.pdf")   # Save graphs

1.6 R basic operations

# Random sampling
df <- data.frame(x = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10),
                 y = c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i' ,'j'))   
sample(df$y[df$x > 5], 2)

[1] "g" "i"

sample(df$y[df$x <= 5], 2)

[1] "e" "a"

1.7 Two ways to remove NA & outlier from data set

x=c(1,2,NA,4)
mean(x, na.rm = TRUE)   # Remove NA within a function

[1] 2.333333

x <- x[!is.na(x)]   # Adjust dataset (multidimensional data filtering) selected an undefined column. Need to add ', !is.na(***)' after it.
x <- x[x != boxplot(x, plot=FALSE)$out]   # Excluded outliers from the dataset

1.8 Correlation coefficient function

cor_coe <- function(x, y){
  N <- length(x)
  co <- 1 / (N - 1) * sum((y - mean(y)) * (x - mean(x)))
  sx <- sd(x)
  sy <- sd(y)
  co / (sx * sy)
}

2 Statistic part

H0：Usually, the researcher wants to collect evidence to oppose one side, also known as the null hypothesis (generally including equal to.
H1：Usually, it is the side that researchers collect evidence to support, also known as the alternative hypothesis (generally including not equal to).
Type 1 error: False positive (the hypothesis is right but we reject it) [alpha]
Type 2 error: False negative (the hypothesis is wrong but we didn’t reject it) [beta]
Alpha(Significance level): The maximum allowable probability of making a type I error [0.10, 0.05, 0.01 …]
Beta: …
P-value(Probability value): if P-value <= alpha, reject H0; if P-value > alpha, not reject H0.
Rejection region, critical region: the range of values for which H0 is not probable(The interval where H0 cannot occur.)

2.1 Binomial distribution

e.g. Toss a coin for “n” times, “x” is the frequency of head

x <- 2
n <- 2
p <- 0.5
dbinom(x, size = n, prob = p)  # Same as choose(n, x) * p^x * (1-p)^(n-x)

[1] 0.25

2.2 Z-test

Known population mean and population variance

x <- c(24, 36, 44, 35, 44, 34, 29, 40, 39, 43, 41, 32, 33, 29, 29, 43, 25, 39, 25, 42, 
       29, 22, 22, 25, 14, 15, 14, 29, 25, 27, 22, 24, 18, 17)
mu <- 29
sd_p <- 9
x_bar <- mean(x)
se <- sd_p / sqrt(length(x))
(z_score <- (x_bar - mu) / se)

[1] 0.4382742

(z_critical <- qnorm(1 - 0.05))

[1] 1.644854

(p_value <- pnorm(z_score))

[1] 0.6694062

2.3 T-test

Known population mean but unknown population variance

mu <- 100
x <- c(91, 100, 70, 87, 104, 92, 104, 88, 72, 119)
t.test(x, mu = mu , alternative = "less", conf.level = 0.05)   # One-tailed


    One Sample t-test

data:  x
t = -1.5478, df = 9, p-value = 0.07804
alternative hypothesis: true mean is less than 100
5 percent confidence interval:
     -Inf 84.05409
sample estimates:
mean of x 
     92.7

mu <- 0.02
x <- c(0.011, 0.021, 0.001, 0.007, 0.031, 0.023, 0.026, 0.019)
t.test(x, mu = 0.02, alternative = "two.sided", conf.level = 0.1)   # Two-tailed


    One Sample t-test

data:  x
t = -0.73012, df = 7, p-value = 0.489
alternative hypothesis: true mean is not equal to 0.02
10 percent confidence interval:
 0.01690656 0.01784344
sample estimates:
mean of x 
 0.017375

x1 <- c(24.58, 22.09, 23.70, 18.89, 22.02, 28.71, 24.44, 20.91, 23.83, 20.83)
x2 <- c(21.61, 19.06, 20.72, 15.77, 19, 25.88, 21.48, 17.85, 20.86, 17.77)
t.test(x1, x2, var.equal = TRUE, alternative = "two.sided", mu = 0)   # Two samples


    Two Sample t-test

data:  x1 and x2
t = 2.4396, df = 18, p-value = 0.02528
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.4164695 5.5835305
sample estimates:
mean of x mean of y 
       23        20

2.4 F-test

The significance of testing the difference between two sets of values, where mu1 = mu2 = mu3 = mu4…

x1 <- c(24.58, 22.09, 23.70, 18.89, 22.02, 28.71, 24.44, 20.91, 23.83, 20.83)
x2 <- c(21.61, 19.06, 20.72, 15.77, 19, 25.88, 21.48, 17.85, 20.86, 17.77)
var.test(x2, x1, ratio = 1, alternative = "two.sided", conf.level = 0.95)   # Two set of data "x1, x2" ANOVA calculation


    F test to compare two variances

data:  x2 and x1
F = 1.0586, num df = 9, denom df = 9, p-value = 0.9338
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.2629492 4.2620464
sample estimates:
ratio of variances 
          1.058632

dtf <- data.frame(before = c(198, 201, 210, 185, 204, 156, 167, 197, 220, 186),
                  one = c(194, 203, 200, 183, 200, 153, 166, 197, 215, 184),
                  two = c(191, 200, 192, 180, 195, 150, 167, 195, 209, 179),
                  three = c(188, 196, 188, 178, 191, 145, 166, 192, 205, 175))
rownames(dtf) <- LETTERS[1:10]
dtf2 <- stack(dtf)
names(dtf2) <- c("w", "level")
w_aov <- aov(w ~ level, data = dtf2)
summary(w_aov)

            Df Sum Sq Mean Sq F value Pr(>F)
level        3    569   189.7   0.577  0.634
Residuals   36  11841   328.9

dtf2$subject <- rep(LETTERS[1:10], 4)
w_aov2 <- aov(w ~ level + Error(subject/level), data = dtf2)
summary(w_aov2)    # One-way ANOVA, judging significance based on the '*' symbol after Pr(>F)


Error: subject
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals  9  11632    1292               

Error: subject:level
          Df Sum Sq Mean Sq F value  Pr(>F)    
level      3  569.1  189.69   24.48 7.3e-08 ***
Residuals 27  209.2    7.75                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

dtf <- data.frame(w = c(90,95,100,75,78,90,120,125,130,100,118,112,125,130,135,118,125,132),
                  diet = rep(c("Diet1", "Diet2", "Diet3"), each = 6),
                  gender = rep(c("Male", "Female"), each = 3))
aov_wg <- aov(w ~ diet * gender, data = dtf)
summary(aov_wg)    # Two-way ANOVA, judging significance based on the '*' symbol after Pr(>F)

            Df Sum Sq Mean Sq F value   Pr(>F)    
diet         2   5061  2530.5  56.026 8.19e-07 ***
gender       1    578   578.0  12.797   0.0038 ** 
diet:gender  2     91    45.5   1.007   0.3941    
Residuals   12    542    45.2                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

2.5 X^2-test

Check if there is a correlation between A and B

dtf <- data.frame(colour = c("red", "white", "pink"),   # One-category X^2 test
                  observed = c(72, 63, 125))   # H0:The ratio of red to white to pink is 1:1:2
dtf$expected <- sum(dtf$observed) * c(1, 1, 2) / 4
df <- nrow(dtf) - 1
(chi_sq_score <- sum((dtf$observed - dtf$expected)^2 / dtf$expected))

[1] 1.007692

(critical_point <- qchisq(0.05, df, lower.tail = FALSE))

[1] 5.991465

(p_value <- pchisq(chi_sq_score, df, lower.tail = FALSE))

[1] 0.6042023

dtf <- data.frame(Endangered = c(162, 143, 38, 17, 45),   # Multiple-category X^2 test
                  Threatened = c( 18,  16, 19, 10, 32), 
                  row.names = c('Mammals', 'Birds', 'Reptiles', 'Amphibians', 'Fish'))
chisq.test(dtf)   # Yates’ correction


    Pearson's Chi-squared test

data:  dtf
X-squared = 56.503, df = 4, p-value = 1.573e-11

2.6 Regression & Correlation [Lecture 10]

# Hypothesis test
dtf0 <- read.csv("data/students_env221.csv")
dtf <- dtf0[dtf0$SHOE != boxplot(dtf0$SHOE, plot=FALSE)$out & !is.na(dtf0$SHOE), ]    # Excluded NA and outliers.
env_lm <- lm(SHOE ~ HEIGHT, data = dtf)
env_lm$coefficients       # Intercept & Slope

(Intercept)      HEIGHT 
 -8.6016669   0.2835591

# Hypothesis test about correlation (continue)
(r <- cor(dtf$HEIGHT, dtf$SHOE))    # correlation coefficient

[1] 0.8793958

r ^ 2       # coefficient of determination. When R^2 is closer to 1, it means that the related equation is more referential; The closer to 0, the less referential.

[1] 0.7733371

n <- nrow(dtf)
(t_score <- r * sqrt((n-2) / (1 - r ^ 2)))

[1] 13.44721

(t_critical <- qt(0.975, df = n - 2))

[1] 2.005746

(p_value <- pt(t_score, df = n - 2, lower.tail = FALSE) * 2)

[1] 1.020629e-18

summary(env_lm)    # Quick method, where R-squared represents the correlation coefficient.


Call:
lm(formula = SHOE ~ HEIGHT, data = dtf)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7527 -0.8123 -0.0363  1.0535  3.4117 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.60167    3.58524  -2.399     0.02 *  
HEIGHT       0.28356    0.02109  13.447   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.467 on 53 degrees of freedom
Multiple R-squared:  0.7733,    Adjusted R-squared:  0.7691 
F-statistic: 180.8 on 1 and 53 DF,  p-value: < 2.2e-16

# Hypothesis test about slope (continue)
summary(env_lm)    # Quick method, where the HEIGHT column represents the slope and includes t-scores and p-values.


Call:
lm(formula = SHOE ~ HEIGHT, data = dtf)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7527 -0.8123 -0.0363  1.0535  3.4117 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.60167    3.58524  -2.399     0.02 *  
HEIGHT       0.28356    0.02109  13.447   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.467 on 53 degrees of freedom
Multiple R-squared:  0.7733,    Adjusted R-squared:  0.7691 
F-statistic: 180.8 on 1 and 53 DF,  p-value: < 2.2e-16

(t_critical <- qt(0.975, df = n - 2))

[1] 2.005746

# Hypothesis test about intercept (continue)
summary(env_lm)     # Quick method, the intercept column is both the slope, including t-scores and p-values.


Call:
lm(formula = SHOE ~ HEIGHT, data = dtf)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7527 -0.8123 -0.0363  1.0535  3.4117 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.60167    3.58524  -2.399     0.02 *  
HEIGHT       0.28356    0.02109  13.447   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.467 on 53 degrees of freedom
Multiple R-squared:  0.7733,    Adjusted R-squared:  0.7691 
F-statistic: 180.8 on 1 and 53 DF,  p-value: < 2.2e-16

(t_critical <- qt(0.025, df = n - 1))

[1] -2.004879

# Visualization (another example)
X <- c(4.69, 4.43, 4.04, 4.06, 3.89, 3.93, 3.49, 3.4, 2.97, 2.47, 2.22, 2.13)
Y <- c(15.6, 15.4, 14.9, 14.5, 13.5, 13.4, 12.7, 12.3, 11.4, 10.5, 10.2, 10.0)
dtf <- data.frame(X, Y)
XY_lm <- lm(Y ~ X, data = dtf)
plot(x = dtf$X, y = dtf$Y, xlab = "X(unit)", ylab = "Y(unit)", las = 1, pch = 16)
abline(XY_lm, col = "blue")
mean_X <- mean(dtf$X, na.rm = TRUE)
mean_Y <- mean(dtf$Y, na.rm = TRUE)
points(x = mean_X, y = mean_Y, col = "red", cex = 3)
abline(h = mean_Y, v = mean_X, col = "red", lty = 2)

SessionInfo:

R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8 
[2] LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] GGally_2.1.2  ggplot2_3.4.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.10        plyr_1.8.8         RColorBrewer_1.1-3 pillar_1.9.0      
 [5] compiler_4.2.3     tools_4.2.3        digest_0.6.31      jsonlite_1.8.4    
 [9] evaluate_0.20      lifecycle_1.0.3    tibble_3.2.1       gtable_0.3.3      
[13] nlme_3.1-162       lattice_0.20-45    mgcv_1.8-42        pkgconfig_2.0.3   
[17] rlang_1.1.0        Matrix_1.5-3       cli_3.6.1          rstudioapi_0.14   
[21] yaml_2.3.7         xfun_0.37          fastmap_1.1.0      withr_2.5.0       
[25] dplyr_1.1.0        knitr_1.42         generics_0.1.3     vctrs_0.6.2       
[29] htmlwidgets_1.6.2  grid_4.2.3         tidyselect_1.2.0   reshape_0.8.9     
[33] glue_1.6.2         R6_2.5.1           fansi_1.0.3        rmarkdown_2.21    
[37] farver_2.1.1       magrittr_2.0.3     scales_1.2.1       htmltools_0.5.4   
[41] splines_4.2.3      colorspace_2.1-0   labeling_0.4.2     utf8_1.2.3        
[45] munsell_0.5.0